Solution to DBS Technical Assessment

Predict Loan Defaults by leveraging XgBoost. The structure of the notebook is as follows:-

  1. Read the data and some basic preprocessing.
  2. Some EDA plots and report by leveraing the pandas profiling library.
  3. One hot encoding of categorical variables and dropping highly correlated and constant columns.
  4. Model training using the XgBoost library. I have employed the XgBoost library since it almost always out performs other ML models such as logistic regression, Random forests etc for tabular binary classification tasks such as this. Further it also elgantly handles missing values if any.
  5. Report the metrics on my validation set and generate feature importance plot.
  6. Prepare the submission and store the submission to CSV file.
In [45]:
import sys
# data manipulation
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
import numpy as np

# autoreload magic for developing local packages
%load_ext autoreload
%autoreload 2

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # higher resolution plots

import seaborn as sns
from pandas_profiling import ProfileReport
sns.set_context("poster")
sns.set(rc={'figure.figsize': (8, 5.)})
sns.set_style("whitegrid")
import chart_studio.plotly as py
import plotly.express as px
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [2]:
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn import metrics
from sklearn.model_selection import train_test_split
import xgboost as xgb

Read and preprocess train and test data

In [3]:
def read_and_preprocess_file(file_location):
    data = pd.read_csv(file_location)
    data["loan_amount"] = data.loan_amount.str.replace("$", "")
    data["insured_amount"] = data.insured_amount.str.replace("$", "")
    data["loan_amount"] = data.loan_amount.str.replace(",", "")
    data["insured_amount"] = data.insured_amount.str.replace(",", "")
    data["loan_amount"] = data.loan_amount.astype(float)
    data["insured_amount"] = data.insured_amount.astype(float)
    data["insured_minus_loan"] = data["insured_amount"] - data["loan_amount"]
    dd_mmm_yy = data.request_date.str.split("-", expand=True)
    dd_mmm_yy[2] = "20"+dd_mmm_yy[2]
    data["request_date"] = dd_mmm_yy[2]+"-"+dd_mmm_yy[1]+"-"+dd_mmm_yy[0]
    data["request_date"] = pd.to_datetime(
        data.request_date, format="%Y-%b-%d")
    data['request_year'] = pd.DatetimeIndex(data['request_date']).year
    data['request_year'] = data.request_year.astype('category')
    data['business_type'] = data.business_type.astype('category')
    data['request_month'] = pd.DatetimeIndex(data['request_date']).month
    return data
In [4]:
train_data = read_and_preprocess_file("data/train.csv")
test_data = read_and_preprocess_file("data/test.csv")
train_data.shape, test_data.shape
Out[4]:
((2402, 16), (601, 15))
In [5]:
train_data.head()
Out[5]:
id industry state request_date term employee_count business_new business_type location other_loans loan_amount insured_amount default_status insured_minus_loan request_year request_month
0 4050975007 Others VA 2010-04-27 34 4 New 0 Rural N 35000.0 35000.0 1 0.0 2010 4
1 3735095001 Manufacturing CA 2009-11-05 107 1 New 0 Rural N 15000.0 13500.0 1 -1500.0 2009 11
2 3936555004 Trading CA 2010-02-26 84 1 New 0 Rural Y 265000.0 100000.0 0 -165000.0 2010 2
3 4130405000 Engineering MI 2010-06-10 240 21 New 0 Rural N 255000.0 255000.0 0 0.0 2010 6
4 4263615008 Education NH 2010-09-23 36 1 Existing 0 Rural N 13300.0 6650.0 0 -6650.0 2010 9

Leverage pandas profiling library for EDA

This library generates all the relevant EDA plots and correlations for the relevant columns. Given that this is a small dataset I took adavantage of the library. I have also saved the visualizations to the file visualizations_abhinav.html which is easier to read than the widget below. Do refer to the HTML report as well.

In [6]:
profile = ProfileReport(train_data.drop("id", axis=1), title="Predict Loan Defaults Report")
profile.to_widgets()


In [7]:
profile.to_file(output_file='visualization_abhinav.html')


Analysis of the yearly and monthly variation to loan default.

I am using the year and month information since both the train and test datasets come from the same time period.

  • Looks like a greater percentage defaults in the year 2009 compared to 2010. Could we attribute this to the 2008 financial crisis?
  • We can also infer from the data that some industries such as construction and hotels are likely to fail compared to others such as healthcare and manufacturing.
In [8]:
fig, ax = plt.subplots( 1,2 , figsize=(18,5))
sns.countplot(x='request_year', hue='default_status',
              data=train_data, ax = ax[0]).set_title("Loan defaults by year")
sns.countplot(x='request_month', hue='default_status',
              data=train_data).set_title('Loan defaults by month');

Analysis of impact of other categorial variables on loan default

In [9]:
fig, ax = plt.subplots( 1,2 , figsize=(18,5))
sns.countplot(x='other_loans', hue='default_status',
              data=train_data, ax=ax[0]).set_title('Loan defaults by whether customer has other loans');
sns.countplot(x='business_type', hue='default_status',
              data=train_data, ax = ax[1]).set_title('Loan defaults by business type');
In [10]:
fig, ax = plt.subplots( 1,2 , figsize=(18,5))
sns.countplot(x='business_new', hue='default_status',
              data=train_data, ax = ax[0]).set_title('Loan defaults by business new or old')
sns.countplot(x='industry', hue='default_status',
              data=train_data, ax = ax[1]).set_title('Loan defaults by industry')
plt.xticks(rotation=90);

Addressing the state in US where loan was issued

A few points to note are:-

  • I am removing 3 states in the train set that are not in the test set.
  • Different states have different loan default percentages (based on the chlorepath intertacive plot below) which makes this an important feature.
In [11]:
not_in_test = list(set(train_data.state.unique()) - set(test_data.state.unique()))
not_in_test
Out[11]:
['RI', 'WY', 'WV']
In [12]:
train_data = train_data.loc[~train_data.state.isin(not_in_test)]
set(train_data.state.unique()) - set(test_data.state.unique())
Out[12]:
set()
In [13]:
def get_default_percentage(group):
    num_default= len(group[group.default_status==1])
    return pd.Series({"default_percentage": num_default*100./len(group)})
    
state_df = train_data.groupby("state").apply(get_default_percentage)
state_df["state"] = state_df.index
state_df.reset_index(inplace=True, drop=True)
state_df.head()
Out[13]:
default_percentage state
0 0.000000 AK
1 35.000000 AL
2 54.545455 AR
3 61.818182 AZ
4 35.836177 CA
In [14]:
fig = px.choropleth(state_df,  # Input Pandas DataFrame
                    locations="state",  # DataFrame column with locations
                    color="default_percentage",  # DataFrame column with color values
                    hover_name="state", # DataFrame column hover info
                    locationmode = 'USA-states') # Set to plot as US States
fig.update_layout(
    title_text = 'State by loan default percentage', # Create a Title
    geo_scope='usa',  # Plot only the USA instead of globe
)
fig.show()  # Output the plot to the screen

Drop some columns

  • I am dropping the insured_amount column since it is highly correlated with the loan amount.
  • The column insured_minus_loan captures the difference between the insured and loan amounts. Given the positive correlation between default status and insured_minus_loan, it seems that greater the difference between insured and loan amounts higher the likelihood of loan defaulting.
  • The columns location is a constant hence dropping it.
In [15]:
train_data[["loan_amount","insured_amount","insured_minus_loan", "default_status"]].corr()
Out[15]:
loan_amount insured_amount insured_minus_loan default_status
loan_amount 1.000000 0.949561 -0.578190 -0.247189
insured_amount 0.949561 1.000000 -0.293174 -0.223572
insured_minus_loan -0.578190 -0.293174 1.000000 0.171929
default_status -0.247189 -0.223572 0.171929 1.000000
In [16]:
train_data.drop(["location","request_date", "insured_amount"], axis=1, inplace=True)
test_data.drop(["location","request_date","insured_amount"], axis=1, inplace=True)

One hot encoding of categorical variables

In [17]:
cat_columns = []
for index in train_data.dtypes.index:
    if index != 'user_id' and index != 'is_unengaged':
        if(not("float" in str(train_data.dtypes[index])
               or "int" in str(train_data.dtypes[index]))):
            cat_columns.append(index)
cat_columns
Out[17]:
['industry',
 'state',
 'business_new',
 'business_type',
 'other_loans',
 'request_year']
In [18]:
train_data = pd.get_dummies(train_data, columns=cat_columns)
test_data = pd.get_dummies(test_data, columns=cat_columns)
train_data.shape, test_data.shape
Out[18]:
((2376, 78), (601, 77))

Split the train data into train and valid

Stratify using the dependent variable default_status

In [19]:
train, valid = train_test_split(train_data, test_size=0.2, random_state=42,
                                          stratify=train_data.default_status)
print(train.shape, valid.shape)
(1900, 78) (476, 78)
In [20]:
default_status_train = train.default_status
id_train = train.id
train = train[train.columns.difference(['id', 'default_status'])]
train = train.reindex(sorted(train.columns), axis=1)

X_train = np.array(train)
y_train = np.array(default_status_train.astype('category'))
print(X_train.shape, y_train.shape)
(1900, 76) (1900,)
In [21]:
default_status_valid = valid.default_status
id_valid = valid.id
valid = valid[valid.columns.
              difference(['id', 'default_status'])]
valid = valid.reindex(sorted(valid.columns), axis=1)
X_valid = np.array(valid)
y_valid = np.array(default_status_valid.astype('category'))
print(X_valid.shape, y_valid.shape)
(476, 76) (476,)

Xgboost model training with grid search CV

Use grid search CV for the determining the best model parameters.

Determine the complexity of the XgBoost model

  • Determine the number of estimators (trees) and the maximum depth of the trees.
In [22]:
max_depth = [4, 6, 8]
n_estimators = [60, 70, 80, 90, 100]
In [46]:
params = {
    'scale_pos_weight': [1.],
    # 'early_stopping_rounds': [8],
    'max_depth':max_depth,
    'subsample': [.8],
    'colsample_bytree': [.8],
    'n_estimators': n_estimators,
    'learning_rate': [0.05],
    'nthread': [5],
    'tree_method': ['hist']
}

xgb_classifier = xgb.XGBClassifier()
rs = GridSearchCV(xgb_classifier, params, cv=3,
                  n_jobs=1, scoring='f1', verbose=1)
In [47]:
grid_result = rs.fit(X_train, y_train)
gbm = rs.best_estimator_
Fitting 3 folds for each of 15 candidates, totalling 45 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:    3.0s finished
In [48]:
print(f"best score GBM: {rs.best_score_}\n"
                          f"best params GBM: {rs.best_params_}")
best score GBM: 0.8490457271948276
best params GBM: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 80, 'nthread': 5, 'scale_pos_weight': 1.0, 'subsample': 0.8, 'tree_method': 'hist'}
In [49]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = rs.cv_results_['mean_test_score']
stds = rs.cv_results_['std_test_score']
params = rs.cv_results_['params']
# plot results
scores = np.array(means).reshape(len(max_depth), len(n_estimators))
f = plt.figure(figsize=(10, 5))
for i, value in enumerate(max_depth):
    plt.plot(n_estimators, scores[i], label='depth: ' + str(value),  linewidth=3)
plt.legend()
plt.xlabel('n_estimators')
plt.ylabel('F1 score');
Best: 0.849046 using {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 80, 'nthread': 5, 'scale_pos_weight': 1.0, 'subsample': 0.8, 'tree_method': 'hist'}

Determine if scale pos weight and regularization i.e. gamma are required

The scale_pos_weight parameters captures the slight imbalance in the data by weighting the errors on misclassifying the loan default around 2 times as much as the error in misclassifying a non default.

Choose max_depth as 6 and the n_estimators as 120 based on the plot above.

In [27]:
scale_pos_weight = y_train.shape[0]/np.sum(y_train == 1)-1.0
scale_pos_weight
Out[27]:
2.0944625407166124
In [28]:
params = {
    'scale_pos_weight': [1., scale_pos_weight],
    'gamma':[1., 2., 0.],
    'max_depth':[6],
    'subsample': [.8],
    'colsample_bytree': [.8],
    'n_estimators': [80],
    'learning_rate': [0.05],
    'nthread': [5],
    'tree_method': ['hist']
}

xgb_classifier = xgb.XGBClassifier()
rs = GridSearchCV(xgb_classifier, params, cv=3,
                  n_jobs=1, scoring='f1', verbose=1)
In [29]:
grid_result = rs.fit(X_train, y_train)
gbm = rs.best_estimator_
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    1.3s finished
In [30]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
Best: 0.849046 using {'colsample_bytree': 0.8, 'gamma': 0.0, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 80, 'nthread': 5, 'scale_pos_weight': 1.0, 'subsample': 0.8, 'tree_method': 'hist'}

Report and plot the relevant binary classification metrics and the feature importances

The metrics captured on the validation set are:-

  1. Accuracy
  2. AUC
  3. AUC PR (area under the precision recall curve)
  4. F1 score
In [31]:
"""
Compute binary classification metrics
"""


def compute_validation_acc(model, features_test, target):
    predict_proba = model.predict_proba(features_test)
    predicted = model.predict(features_test)
    pred_prob = np.array(predict_proba[:, 1])
    area_under_curve = roc_auc_score(target, pred_prob)
    labels = ['non default', 'defaults']
    print("accuracy score:", accuracy_score(
        target, model.predict(features_test)))
    print(target.shape, np.sum(predicted))
    cm = confusion_matrix(target, predicted)
    print(cm)

    sns.set(rc={'figure.figsize': (18, 5)})
    fig, ax = plt.subplots(1, 3)
    fig.suptitle("Metrics for binary classification", fontsize=16)

    # confusion matrix
    akws = {"ha": 'center', "va": 'center', "size": 17}
    # annot=True to annotate cells
    sns.heatmap(cm, annot=True, ax=ax[0], annot_kws=akws)
    ax[0].set_xlabel('Predicted labels')
    ax[0].set_ylabel('True labels')
    ax[0].set_title('Confusion Matrix')
    ax[0].xaxis.set_ticklabels(labels)
    ax[0].yaxis.set_ticklabels(labels)

    # ROC
    fpr, tpr, _ = metrics.roc_curve(target,  pred_prob)
    ax[1].plot(fpr, tpr, label=f"auc = {area_under_curve:.3f}")
    ax[1].set_xlabel('True positive rate')
    ax[1].set_title('ROC curve')
    ax[1].set_ylabel('False positive rate')
    ax[1].legend(loc="lower right")

    # precision recall
    precision, recall, thresholds = precision_recall_curve(target, pred_prob)
    f1 = f1_score(target, predicted)
    auc_pr = auc(recall, precision)
    ap = average_precision_score(target, pred_prob)
    print('f1=%.3f auc_pr=%.3f avg_pr=%.3f auc=%.3f' %
          (f1, auc_pr, ap, area_under_curve))

    ax[2].plot(precision, recall, label=f"auc_pr = {auc_pr:.3f}")
    ax[2].set_title('PR Curve')
    ax[2].set_xlabel('Recall')
    ax[2].set_ylabel('Precision')
    ax[2].legend(loc="upper right")
In [32]:
compute_validation_acc(gbm, X_valid,y_valid.astype(int))
accuracy score: 0.9054621848739496
(476,) 131
[[311  11]
 [ 34 120]]
f1=0.842 auc_pr=0.908 avg_pr=0.908 auc=0.944
In [33]:
sorted_idx = np.argsort(gbm.feature_importances_)[::-1]
feature_importance = {}
for index in sorted_idx:
    feature_importance[train.columns[index]
                       ] = gbm.feature_importances_[index]

n_features = X_train.shape[1]
ss = sorted(feature_importance, key=feature_importance.get, reverse=True)
top_names = ss[0:]
f = plt.figure(figsize=(22, 6))
plt.grid(True)
plt.yticks(fontsize=18)
plt.xticks(range(n_features), top_names, rotation=90, fontsize=18)
plt.title("Feature importances", fontsize=22)
plt.bar(range(n_features), [feature_importance[i]
                            for i in top_names], color="#ff471a", align="center")
plt.xlim(-1, n_features)
plt.ylabel('Relative feature importance', fontsize=20);

Prepare the test data for submission

Generate the submissions.csv file as well

In [34]:
id_test = test_data.id
test_data = test_data.drop(['id'], axis=1)
test_data = test_data.reindex(sorted(test_data.columns), axis=1)
X_test = np.array(test_data)
print(X_test.shape)
(601, 76)
In [35]:
if list(train.columns) == list(test_data.columns):
    predicted = gbm.predict(X_test)
    submission = pd.DataFrame(
        {"id": list(id_test), "default_status": list(predicted)})
    print(submission.default_status.value_counts())
    submission.to_csv("submissions_abhinav.csv", index=False)
0    403
1    198
Name: default_status, dtype: int64
In [36]:
submission.head()
Out[36]:
id default_status
0 3999155010 0
1 4035035009 0
2 3889475000 0
3 3794845001 0
4 4163475006 0
In [ ]: